2017-05-17
ismayc@old_man_chesterrudeboybert@rudeboybertLINK TO DOCUMENT
tidyverse, then stats.Actual dialogue I had with a student:
Cobb (TAS 2015): Minimizing prerequisites to research. In other words, focus on entirety of Wickham/Grolemund's pipeline…
… and not just this part.
Furthermore use data science tools that a data scientist would use. Example: tidyverse
What does this buy us?
nycflights13 and fivethirtyeightCobb (TAS 2015): Two possible "computational engines" for statistics, in particular relating to sampling:
We present students with a choice for our "engine":
| Either we use this… | Or we use this… |
|---|---|
What does this buy us?
Why should we do this?
Insert appropriate image
DataCamp offers an interactive, browser based tool for learning R/Python. Their two flagship R courses, both of which are free:
Outsource many essential but not fun to teach topics like
ggplot2 package and knowledge of the Grammar of Graphics primes students for regressionbroom package to unpack regressionggplot2 Primes Regressionggplot2 Primes RegressionExample:
This involves four variables carrier, temp, dep_delay, summer
ggplot2 Primes Regressionggplot2 Primes RegressionWhy? Dig deeper into data. Look at origin and dest variables as well:
| carrier | origin | dest | Number of Flights |
|---|---|---|---|
| AS | EWR | SEA | 712 |
| F9 | LGA | DEN | 675 |
broom Packagebroom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.tidyverse ecosystembroom PackageIn our case, broom functions take lm objects as inputs and return the following in tidy format!
tidy(): regression output tableaugment(): point-by-point values (fitted values, residuals, predicted values)glance(): scalar summaries like \(R^2\),broom PackageThe chapter will be built around this code:
library(ggplot2) library(dplyr) library(nycflights13) library(knitr) library(broom) set.seed(2017) # Load Alaska data, deleting rows that have missing departure delay # or arrival delay data alaska_flights <- flights %>% filter(carrier == "AS") %>% filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% sample_n(50) View(alaska_flights) # Exploratory Data Analysis---------------------------------------------------- # Plot of sample of points: ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() # Correlation coefficient: alaska_flights %>% summarize(correl = cor(dep_delay, arr_delay)) # Add regression line ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") # Fit Regression and Study Output with broom Package--------------------------- # Fit regression delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights) # 1. broom::tidy() regression table with confidence intervals and no p-value stars regression_table <- delay_fit %>% tidy(conf.int=TRUE) regression_table %>% kable(digits=3) # 2. broom::augment() for point-by-point values regression_points <- delay_fit %>% augment() %>% select(arr_delay, dep_delay, .fitted, .resid) regression_points %>% head() %>% kable(digits=3) # and for prediction new_flights <- data_frame(dep_delay = c(25, 30, 15)) delay_fit %>% augment(newdata = new_flights) %>% kable() # 3. broom::glance() scalar summaries of regression regression_summaries <- delay_fit %>% glance() regression_summaries %>% kable(digits=3) # Residual Analysis------------------------------------------------------------ ggplot(data = regression_points, mapping = aes(x = .resid)) + geom_histogram(binwidth=10) + geom_vline(xintercept = 0, color = "blue") ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) + geom_point() + geom_abline(intercept = 0, slope = 0, color = "blue") ggplot(data = regression_points, mapping = aes(sample = .resid)) + stat_qq() # Preview of Multiple Regression----------------------------------------------- alaska_flights <- alaska_flights %>% mutate(summer = month == 6 | month == 7 | month == 8) ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay, col=summer)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
By July 1st, 2017
knitr::kable() output orView() function.